Statistics for Data Science II
We have previously discussed continuous outcomes and the normal distribution.
Let’s now consider categorical outcomes:
Binary
Ordinal
Multinomial
\ln \left( \frac{\pi}{1-\pi} \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k,
where \pi = \text{P}[Y = 1] = the probability of the outcome/event.
How is this different from linear regression?
y = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k
library(tidyverse)
richmondway <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-09-26/richmondway.csv') %>%
mutate(dating = if_else(Dating_flag == "Yes", 1, 0),
IMDB = if_else(Imdb_rating >= 8.5, 1, 0)) %>%
select(Season, Episode, F_count_RK, F_perc, dating, IMDB)
# quantile(richmondway$Imdb_rating, c(0, 0.25, 0.5, 0.75, 1))
# richmondway %>% count(IMDB_9) m1 <- glm(dating ~ F_perc + IMDB,
data = richmondway,
family = "binomial"(link="logit"))
summary(m1)
Call:
glm(formula = dating ~ F_perc + IMDB, family = binomial(link = "logit"),
data = richmondway)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.76166 1.08995 -1.616 0.106
F_perc 0.03323 0.02506 1.326 0.185
IMDB 0.37986 0.72250 0.526 0.599
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 46.662 on 33 degrees of freedom
Residual deviance: 44.261 on 31 degrees of freedom
AIC: 50.261
Number of Fisher Scoring iterations: 4
\ln \left( \frac{\hat{\pi}}{1-\hat{\pi}} \right) = -1.76 + 0.03 x_1 + 0.38 x_2,
where
x_1 is the episode’s percentage of the F-bombs from Roy Kent
x_2 is the IMDB rating categorization of the episode
\ln \left( \frac{\pi}{1-\pi} \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k,
We are modeling the log odds, which are not intuitive with interpretations.
To be able to discuss the odds, we will “undo” the natural log by exponentiation.
i.e., if we want to interpret the slope for x_i, we will look at e^{\hat{\beta}_i}.
When interpreting \hat{\beta}_i, it is an additive effect on the log odds.
When interpreting e^{\hat{\beta}_i}, it is a multiplicative effect on the odds.
\begin{align*} \ln \left( \frac{\pi}{1-\pi} \right) &= \beta_0 + \beta_1 x_1 + ... + \beta_k x_k \\ \exp\left\{ \ln \left( \frac{\pi}{1-\pi} \right) \right\} &= \exp\left\{ \beta_0 + \beta_1 x_1 + ... + \beta_k x_k \right\} \\ \frac{\pi}{1-\pi} &= e^{\beta_0} e^{\beta_1 x_1} \cdots e^{\beta_k x_k} \end{align*}
For continuous predictors:
For categorical predictors:
Let’s interpret the odds ratios:
For a 1 percentage point increase in the percentage of f-bombs that came from Roy Kent, the odds of Roy and Keeley dating increase by 3%.
As compared to when episodes have less than an IMDB rating of 8.5, the odds of Roy and Keeley dating are 46% higher in episodes with an IMDB rating of at least 8.5.
summary():
Call:
glm(formula = dating ~ F_perc + IMDB, family = binomial(link = "logit"),
data = richmondway)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.76166 1.08995 -1.616 0.106
F_perc 0.03323 0.02506 1.326 0.185
IMDB 0.37986 0.72250 0.526 0.599
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 46.662 on 33 degrees of freedom
Residual deviance: 44.261 on 31 degrees of freedom
AIC: 50.261
Number of Fisher Scoring iterations: 4
What we’ve learned so far re: significance of predictors holds true with logistic regression
The guidelines we’ve set up for data visualization still hold true.
We will put our outcome on the y-axis and a continuous (or at least ordinal) predictor on the x-axis.
Recall the logistic regression model, \ln \left( \frac{\pi_i}{1-\pi_i} \right) = \beta_0 + \beta_1 x_{1i} + ... + \beta_k x_{ki}
We can solve for the probability, which allows us to predict the probability that y_i=1 given the specified model: \pi_i = \frac{\exp\left\{ \beta_0 + \beta_1 x_{1i} + ... + \beta_k x_{ki} \right\}}{1 + \exp\left\{ \beta_0 + \beta_1 x_{1i} + ... + \beta_k x_{ki} \right\}}